Welcome to the 2023 Epigenomic Analysis Canadian Bioinformatics Workshop webpage!
Professor, McGill University
Director of Bioinformatics, Genome Quebec Innovation Centre
Director, Canadian Centre of Computational Genomics
Director, McGill initiative for Computational Medicine
Dr. Bourque research interests are in comparative and functional genomics with a special emphasis on applications of next-generation sequencing technologies. His lab develops advanced tools and scalable computational infrastructure to enable large-scale applied research projects.
Distinguished Scientist, BC Cancer
Professor, Department of Microbiology & Immunology
Director, Michael Smith Laboratories
Michael Smith Laboratories
Dr. Hirst’s research focuses on understanding epigenetic dysfunction in cancer and his laboratory develops experimental and computational tools to characterize normal and transformed cell types down to the single cell level. He applies these tools to explore the epigenomic states of normal and transformed cell types to discover and exploit therapeutic vulnerabilities.
Bioinformatics Manager, Data Unit
Canadian Centre of Computational Genomics
He joined the McGill Epigenomic Data Coordination Center at McGill in 2012 to tackle challenges related to epigenomics, and has since developed many data management and discovery solutions, including the IHEC Data Portal. Other projects of interest include CanDIG and EpiShare, platforms to make genomic and epigenomic data under controlled access more accessible, while maintaining study participants’ privacy.
Bioinformatics Manager, Tech Dev Unit
Canadian Centre of Computational Genomics
As a Bioinformatics Specialist in the Research and Development team, Jose Hector is involved in maintaining, documenting, and upgrading the RNA-seq pipelines in GenPipes. He also collaborates in several research projects, mostly focusing on transcriptomics, genome assembly, and epigenomics.
Bioinformatician
Ontario Institute for Cancer Research
Edmund is a bioinformatician within the genome informatics team at OICR, where he provides technical knowledge and expertise in genomic analysis. His main focus is developing pipelines and data wrangling for ICGC-ARGO (International Cancer Genome Consortium - Accelerating Research in Genomic Oncology).
Bioinformatics Analyst, Tech Dev Unit
Canadian Centre of Computational Genomics
As a Bioinformatics Analyst in the TechDev team, Mareike is responsible for maintaining pipelines in GenPipes, testing and developing new pipelines for the community, and responding to user requests. Prior to joining C3G, she worked as both a field and computational biologist, with her doctoral and postdoctoral research focusing on mammalian comparative genomics, phylogenomics, and metagenomics, most often in primates.
Acting Scientific Director
Canadian Bioinformatics Workshops (CBW)
Toronto, ON, CA
Dr. Michelle Brazas is the Associate Director for Adaptive Oncology at the Ontario Institute for Cancer Research (OICR), and acting Scientific Director at Bioinformatics.ca. Previously, Dr. Brazas was the Program Manager for Bioinformatics.ca and a faculty member in Biotechnology at BCIT. Michelle co-founded and runs the Toronto Bioinformatics User Group (TorBUG) now in its 11th season, and plays an active role in the International Society of Computational Biology where she sits on the Board of Directors and Executive Board.
Program Manager, Bioinformatics.ca
Ontario Institute for Cancer Research
Toronto, ON, Canada
Nia is the Program Manager for Bioinformatics.ca, where she coordinates the Canadian Bioinformatics Workshop Series. Prior to starting at OICR, she completed her M.Sc. in Bioinformatics from the University of Guelph in 2020 before working there as a bioinformatician studying epigenetic and transcriptomic patterns across maize varieties.
HPC and Bioinformatics Services Manager at Princess Margaret
Cancer Centre, University Health Network
Bioinformatics and HPC Core, UHN
MaRS Centre, PMCRT 11-707
101 College St
Toronto ON M5G 1L7
Zhibin Lu is a senior manager at University Health Network Digital. He is responsible for UHN HPC operations and scientific software. He manages two HPC clusters at UHN, including system administration, user management, and maintenance of bioinformatics tools for HPC4health. He is also skilled in Next-Gen sequence data analysis and has developed and
maintained bioinformatics pipelines at the Bioinformatics and HPC Core. He is a member of the Digital Research Alliance of Canada Bioinformatics National Team and Scheduling National Team.
Note
Some of these files are quite large. If you have issues downloading them, contact us at support@bioinformatics.ca.
Connecting and properly using a cloud computing cluster at the CBW here
We have made our AWS AMI (Amazon Machine Image) publicly available. To launch your own instance, follow the instructions provided by Amazon on how to launch an EC2 instance from a custom Amazon Machine Image. Please note that you will need an AWS account to proceed, and that you will need to upload the CourseData files yourself.
Here are the details of the AMI:
If you want to create and activate a new AWS account, please follow the instructions provided by Amazon.
Install Miniconda by following the instructoion at Miniconda official site
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools.static.binary
mv bedtools.static.binary bedtools
chmod +x bedtools
wget https://newcontinuum.dl.sourceforge.net/project/bio-bwa/bwa-0.7.17.tar.bz2
tar -jxvf bwa-0.7.17.tar.bz2
cd bwa-0.7.17/
#### FastQC
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
unzip fastqc_v0.12.1.zip
wget https://github.com/samtools/samtools/releases/download/1.17/samtools-1.17.tar.bz2
tar -jxvf samtools-1.17.tar.bz2
cd samtools-1.17
make
sudo make install
wget https://github.com/broadinstitute/picard/releases/download/3.1.0/picard.jar
pip install deeptools
pip install MACS2
wget http://homer.ucsd.edu/homer/configureHomer.pl
perl ./configureHomer.pl -install
perl ./configureHomer.pl -install hg19
perl ./configureHomer.pl -install hg38
rsync -aP hgdownload.soe.ucsc.edu::genome/admin/exe/linux.x86_64/ ./
wget https://github.com/FelixKrueger/Bismark/archive/refs/tags/v0.24.2.tar.gz
tar -zxf v0.24.2.tar.gz
Code:
Like this! This is the main code to run for the step.
Additionally the code block will include a header to indicate what environment to the run code for example:
###Shell###
pwd
###R###
getwd()
pwd & getwd()
Note
Important points and considerations will also be raised as so.
Code:
###Shell###
mkdir -p ~/workspace/module123/BWA_index
mkdir -p ~/workspace/module123/alignments
mkdir -p ~/workspace/module123/fastqc
mkdir -p ~/workspace/module123/stats
mkdir -p ~/workspace/module123/peaks
mkdir -p ~/workspace/module123/bigBed
mkdir -p ~/workspace/module123/bigWig
mkdir -p ~/workspace/module123/resources
mkdir -p ~/workspace/module123/diffBind
mkdir -p ~/workspace/module123/edgeR
mkdir -p ~/workspace/module123/qc
mkdir -p ~/workspace/module123/BWA_index
mkdir creates a directory-p creates the “parent” if the initial parent directory did not existmkdir -p ~/workspace/module123/{BWA_index,alignments,fastqc,stats,peaks,bigBed,bigWig,resources,diffBind,edgeR,qc}Hg38 and older hg19 human genome.chr19 of Hg38/Grch38. This is to speed demonstration purposes.Code:
###Shell###
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr19.fa.gz -O ~/workspace/module123/BWA_index/chr19.fa.gz
gunzip ~/workspace/module123/BWA_index/chr19.fa.gz -c > ~/workspace/module123/BWA_index/chr19.fa
Runtime : <1 minute
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr19.fa.gz -O ~/workspace/module123/BWA_index/chr19.fa.gz
Wget is a command to pull online files.chr19.fa.gz from the example address.-O ~/workspace/module123/BWA_index/chr19.fa.gz instructs where and what we would like to save our file as.-O?wget, Curl is another useful alternative.gunzip ~/workspace/module123/BWA_index/chr19.fa.gz -c > ~/workspace/module123/BWA_index/chr19.fa
gzip is a data compression format useful for reducing storage impactsgunzip is the opposite decompresses gzipped data-c is a STDOUT redirect. Normally running gunzip file.gz will turn file.tsv.gz into file.tsv. Running -c allows us to save the contents elsewhere. We do this we can compare the size difference.ls ~/workspace/module123/BWA_index/chr19.fa.gz -ilth vs ls ~/workspace/module123/BWA_index/chr19.fa -ilth and note the size differenceNote
Prior to any tool usage, it’s good practice to explore the tool by invoking the help command. This may vary depending on tool. This is important as the same -O flag could have differing functions in different tools.
Hg38/GRCh38 (such as those with or without ALT contigs and EBV) each version would need their own indices.Note
A newer version of BWA-mem does exist, however we’ll be using the older version due to a bug that requires significant memory for indexing.
Code:
###Shell###
mkdir -p ~/workspace/module123/BWA_index
bwa index ~/workspace/module123/BWA_index/chr19.fa > ~/workspace/module123/BWA_index/index.log
bwa index ~/workspace/module123/BWA_index/chr19.fa
ls ~/workspace/module123/BWA_index/Code:
###Shell###
fastq_input=~/CourseData/EPI_data/module123/fastq/MCF10A.Input.chr19.R1.fastq.gz
fastq_h3k4me3=~/CourseData/EPI_data/module123/fastq/MCF10A.H3K4me3.chr19.R1.fastq.gz
mkdir -p ~/workspace/module123/fastqc
fastqc -t 8 ${fastq_input} -o ~/workspace/module123/fastqc > ~/workspace/module123/fastqc/input.log 2>&1
fastqc -t 8 ${fastq_h3k4me3} -o ~/workspace/module123/fastqc > ~/workspace/module123/fastqc/h3k4me3.log 2>&1
fastqc
-t 8 specifies the number of threads we want to program to utilize-o specifies our output directory
### Step 3A: FastQC and interpretationCode:
###Browser###
http://main.uhn-hpc.ca/module123/fastqc/MCF10A.Input.chr19.R1_fastqc.html
http://main.uhn-hpc.ca/module123/fastqc/MCF10A.H3K4me3.chr19.R1_fastqc.html
Note
The URLs link to the TA’s instance, make sure to replace the domain with your own.
Per base sequence quality
Per tile sequence quality
Per sequence quality score
Per base sequence content
Per sequence GC content
Per base N content
Ns are low confidence calls by the sequencer, we expect very few instancesSequence Length Distribution
Sequence duplication levels
Overrepresented sequences
Sequence duplication levelsAdapter Content
Code:
###Shell###
ref=~/workspace/module123/BWA_index/chr19.fa
read1=~/CourseData/EPI_data/module123/fastq/MCF10A.Input.chr19.R1.fastq.gz
read2=~/CourseData/EPI_data/module123/fastq/MCF10A.Input.chr19.R2.fastq.gz
sample=MCF10A_input_chr19
bwa mem -M -t 4 ${ref} ${read1} ${read2} 2>~/workspace/module123/alignments/${sample}.alignment.log | samtools view -hbS -o ~/workspace/module123/alignments/${sample}.bam
run time ~2 min
The command can be broken down into the following pseudo code ALIGNMENT | SAMtoBAM_conversion. The | operate streams the results from the first command into the second.
bwa mem -M -t 4 ${ref} ${read1} ${read2} 2>~/workspace/module123/alignments/alignment.log
| the pipe delimiter passes the results to the next step ACTION A | ACTION B | ACTION C (like an assembly line)bwa mem while BWA has many alignment algorithms, mem is best suited for efficiently handling >70bp paired end reads.-M alters flagging of chimeric reads. By default chimeric reads are flagged as SUPPLEMENTARY (partially mapping), the option instead turns them to SECONDARY (multi-mapping). Needed to GATK/PICARD support downstream-t 4 resource specification2>~/workspace/module123/alignments/alignment.log data outputted occurs in two streams 1 (the data) and 2 (debug messages, warnings and status updates). We redirect 2 to a log file.Note
Logs contain valuable information, helpful for debugging.
samtools view -hbS -o ~/workspace/module123/alignments/${sample}.bam
samtools view a tool to read our SAM,BAM,CRAM files-hbS include header, output as BAM, input is SAM-o specify what we want to save our file asBAM file. Note the header and the body.samtools?
Code:
###Shell###
sample=MCF10A_input_chr19
samtools sort -@8 ~/workspace/module123/alignments/${sample}.bam -o ~/workspace/module123/alignments/${sample}.sorted.bam
samtools sort sorts our reads by genome coordinatessamtools view | headCode:
###Shell###
sample=MCF10A_input_chr19
java -jar /usr/local/picard.jar MarkDuplicates \
I=~/workspace/module123/alignments/${sample}.sorted.bam \
O=~/workspace/module123/alignments/${sample}.sorted.dup_marked.bam \
M=~/workspace/module123/alignments/${sample}.dup_marked.output.log \
ASSUME_SORTED=TRUE \
VALIDATION_STRINGENCY=LENIENT \
> ~/workspace/module123/alignments/${sample}.dup_marked.error.log
java -jar /usr/local/picard.jar MarkDuplicates I=~/workspace/module123/alignments/${sample}.sorted.bam O=~/workspace/module123/alignments/${sample}.sorted.dup_marked.bam M=~/workspace/module123/alignments/${sample}.dup_marked.output.log ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT
java -jar /usr/local/picard.jar produced by BroadInstitute, Picard tools are a toolset for HTS/NGS dataMarkDuplicates identifies reads/clusters duplicates arising from library construciton during PCR and cluster formation during sequencingI= and O= are input and output respectivelyM= saves our output logASSUME_SORTED=TRUE informs PICARD, our input is already coordinate sortedVALIDATION_STRINGENCY=LENIENT informs PICARD to continue and notify problems in the work.log instead of failingCode:
###Shell###
sample=MCF10A_input_chr19
mkdir -p ~/workspace/module123/stats
samtools flagstat ~/workspace/module123/alignments/${sample}.sorted.dup_marked.bam > ~/workspace/module123/stats/${sample}.sorted.dup_marked.flagstat
samtools stats ~/workspace/module123/alignments/${sample}.sorted.dup_marked.bam > ~/workspace/module123/stats/${sample}.sorted.dup_marked.stat
samtools flagstat tallies FLAGS and produces a summarized reportsamtools stats collects various metrics on the BAM file include:
flagstats###Shell##
ref=~/workspace/module123/BWA_index/chr19.fa
for histone in H3K27ac H3K27me3 H3K4me3 ATAC;
do
read1=~/CourseData/EPI_data/module123/fastq/MCF10A.${histone}.chr19.R1.fastq.gz
read2=~/CourseData/EPI_data/module123/fastq/MCF10A.${histone}.chr19.R2.fastq.gz
sample=MCF10A_${histone}_chr19
echo "aligning ${histone}"
bwa mem -M -t 4 ${ref} ${read1} ${read2} 2> ~/workspace/module123/alignments/${sample}.alignment.log | samtools view -hbS -o ~/workspace/module123/alignments/${sample}.bam
echo "sorting ${histone}"
samtools sort -@8 ~/workspace/module123/alignments/${sample}.bam -o ~/workspace/module123/alignments/${sample}.sorted.bam
echo "dupmarking ${histone}"
java -jar /usr/local/picard.jar MarkDuplicates I=~/workspace/module123/alignments/${sample}.sorted.bam O=~/workspace/module123/alignments/${sample}.sorted.dup_marked.bam M=~/workspace/module123/alignments/${sample}.dup_marked.output.log ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT > ~/workspace/module123/alignments/${sample}.dup_marked.error.log
echo "calculating stats ${histone}"
samtools flagstat ~/workspace/module123/alignments/${sample}.sorted.dup_marked.bam > ~/workspace/module123/stats/${sample}.sorted.dup_marked.flagstat
samtools stats ~/workspace/module123/alignments/${sample}.sorted.dup_marked.bam > ~/workspace/module123/stats/${sample}.sorted.dup_marked.stat
done
Code:
###Shell###
ls ~/workspace/module123/alignments/*.bam | grep -v sorted.dup_marked | xargs -I {} sh -c "echo rm {}; rm {}"
ls ~/workspace/module123/alignments/*.bam
BAM files in our directory : ~/workspace/module123/alignments* in *.bam acts as a wild card for any string of variable length ending in the suffix bamgrep -v sorted.dup_marked
grep a powerful tool that searches and matches text-v return matches that do not much our pattern or regexsorted.dup_marked is our pattern, returning 2/3 of the files : MCF10A_input_chr19.bam,MCF10A_input_chr19.sorted.bam,MCF10A_input_chr19.sorted.dup_marked.bamxargs -I {} sh -c "echo rm {}; rm {}"
xargs receives input and performs an action. Think a for-loop-I {} the variable we want to usesh -c interpret the string provided in bash shellecho rm {} echos the command we want to performrm {} removes the file###TSS+/-2kb
mkdir workspace/module123/qc
wget https://www.bcgsc.ca/downloads/esu/touchdown/hg38v79_genes_tss_2000.bed -O workspace/module123/resources/hg38v79_genes_tss_2000.bed
sort -k1,1 -k2,2n workspace/module123/resources/hg38v79_genes_tss_2000.bed > tmp
mv tmp workspace/module123/resources/hg38v79_genes_tss_2000.bed
###Enhancer liftover
wget https://www.bcgsc.ca/downloads/esu/touchdown/encode_enhancers_liftover.bed -O workspace/module123/resources/encode_enhancers_liftover.bed
###Blacklist
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O ~/workspace/module123/resources/hg38_blacklist.bed.gz
gunzip ~/workspace/module123/resources/hg38_blacklist.bed.gz
hg38v79_genes_tss_2000.bed
encode_enhancers_liftover.bed
ls ~/CourseData/EPI_data/module123/encode_bed
basal.H3K27ac.peak_calls.bed
basal.H3K27me3.peak_calls.bed
basal.H3K4me1.peak_calls.bed
basal.H3K4me3.peak_calls.bed
lp.H3K27ac.peak_calls.bed
lp.H3K27me3.peak_calls.bed
lp.H3K4me1.peak_calls.bed
lp.H3K4me3.peak_calls.bed
luminal.H3K27ac.peak_calls.bed
luminal.H3K27me3.peak_calls.bed
luminal.H3K4me1.peak_calls.bed
luminal.H3K4me3.peak_calls.bed
stromal.H3K27ac.peak_calls.bed
stromal.H3K27me3.peak_calls.bed
stromal.H3K4me1.peak_calls.bed
stromal.H3K4me3.peak_calls.bed
ls ~/CourseData/EPI_data/module123/encode_bigWig
basal.H3K27ac.signal_unstranded.bigWig
basal.H3K27me3.signal_unstranded.bigWig
basal.H3K4me1.signal_unstranded.bigWig
basal.H3K4me3.signal_unstranded.bigWig
lp.H3K27ac.signal_unstranded.bigWig
lp.H3K27me3.signal_unstranded.bigWig
lp.H3K4me1.signal_unstranded.bigWig
lp.H3K4me3.signal_unstranded.bigWig
luminal.H3K27ac.signal_unstranded.bigWig
luminal.H3K27me3.signal_unstranded.bigWig
luminal.H3K4me1.signal_unstranded.bigWig
luminal.H3K4me3.signal_unstranded.bigWig
stromal.H3K27ac.signal_unstranded.bigWig
stromal.H3K27me3.signal_unstranded.bigWig
stromal.H3K4me1.signal_unstranded.bigWig
stromal.H3K4me3.signal_unstranded.bigWig
ls ~/CourseData/EPI_data/module123/fastq
MCF10A.ATAC.chr19.R1.fastq.gz
MCF10A.ATAC.chr19.R2.fastq.gz
MCF10A.H3K27ac.chr19.R1.fastq.gz
MCF10A.H3K27ac.chr19.R2.fastq.gz
MCF10A.H3K27me3.chr19.R1.fastq.gz
MCF10A.H3K27me3.chr19.R2.fastq.gz
MCF10A.H3K4me3.chr19.R1.fastq.gz
MCF10A.H3K4me3.chr19.R2.fastq.gz
MCF10A.Input.chr19.R1.fastq.gz
MCF10A.Input.chr19.R2.fastq.gz
CourseData/EPI_data/module123/triplicates/triplicates.csv
CourseData/EPI_data/module123/triplicates/alignments:
MCF10A_H3K4me3_chr19.CondA.Rep1.bam MCF10A_H3K4me3_chr19.CondB.Rep2.bam MCF10A_input_chr19.CondA.Rep3.bam
MCF10A_H3K4me3_chr19.CondA.Rep1.bam.bai MCF10A_H3K4me3_chr19.CondB.Rep2.bam.bai MCF10A_input_chr19.CondA.Rep3.bam.bai
MCF10A_H3K4me3_chr19.CondA.Rep2.bam MCF10A_H3K4me3_chr19.CondB.Rep3.bam MCF10A_input_chr19.CondB.Rep1.bam
MCF10A_H3K4me3_chr19.CondA.Rep2.bam.bai MCF10A_H3K4me3_chr19.CondB.Rep3.bam.bai MCF10A_input_chr19.CondB.Rep1.bam.bai
MCF10A_H3K4me3_chr19.CondA.Rep3.bam MCF10A_input_chr19.CondA.Rep1.bam MCF10A_input_chr19.CondB.Rep2.bam
MCF10A_H3K4me3_chr19.CondA.Rep3.bam.bai MCF10A_input_chr19.CondA.Rep1.bam.bai MCF10A_input_chr19.CondB.Rep2.bam.bai
MCF10A_H3K4me3_chr19.CondB.Rep1.bam MCF10A_input_chr19.CondA.Rep2.bam MCF10A_input_chr19.CondB.Rep3.bam
MCF10A_H3K4me3_chr19.CondB.Rep1.bam.bai MCF10A_input_chr19.CondA.Rep2.bam.bai MCF10A_input_chr19.CondB.Rep3.bam.bai
CourseData/EPI_data/module123/triplicates/bigWig:
CondA.Rep1.bw CondA.Rep2.bw CondA.Rep3.bw CondB.Rep1.bw CondB.Rep2.bw CondB.Rep3.bw
CourseData/EPI_data/module123/triplicates/peaks:
CondA.Rep1_peaks.narrowPeak CondA.Rep3_peaks.narrowPeak CondB.Rep2_peaks.narrowPeak
CondA.Rep2_peaks.narrowPeak CondB.Rep1_peaks.narrowPeak CondB.Rep3_peaks.narrowPeak
Lab Completed!
Congratulations! You have completed Lab 1!
Code:
Like this! This is the main code to run for the step.
Additionally the code block will include a header to indicate what environment to the run code for example:
###Shell###
pwd
###R###
getwd()
pwd & getwd()
Note
Important points and considerations will also be raised as so.
MACS2(peak caller) identifies duplicates via genomic coordinates (excessive removal of a lot reads).Code:
###Shell###
treatment=MCF10A_H3K27ac_chr19
treatment_bam=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.bam
treatment_dedup=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.dedup.bam
samtools view -@4 ${treatment_bam} -bh -q10 -F1028 -o ${treatment_dedup}
treatment=MCF10A_H3K27me3_chr19
treatment_bam=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.bam
treatment_dedup=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.dedup.bam
samtools view -@4 ${treatment_bam} -bh -q10 -F1028 -o ${treatment_dedup}
treatment=MCF10A_H3K4me3_chr19
treatment_bam=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.bam
treatment_dedup=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.dedup.bam
samtools view -@4 ${treatment_bam} -bh -q10 -F1028 -o ${treatment_dedup}
treatment=MCF10A_ATAC_chr19
treatment_bam=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.bam
treatment_dedup=~/workspace/module123/alignments/${treatment}.sorted.dup_marked.dedup.bam
samtools view -@4 ${treatment_bam} -bh -q10 -F1028 -o ${treatment_dedup}
input=MCF10A_input_chr19
input_bam=~/workspace/module123/alignments/${input}.sorted.dup_marked.bam
input_dedup=~/workspace/module123/alignments/${input}.sorted.dup_marked.dedup.bam
samtools view -@4 ${input_bam} -bh -q10 -F1028 -o ${input_dedup}
samtools view -@4 ${input_bam} -bh -q10 -F1028 -o ${input_dedup}
-bh we would like the output to be in BAM format and contain the header-q10 reads with mapping qual >10-F1028 any reads that do not have the flags UNMAP and DUP
-f and -F?
### Step 2A : Run Peak Caller (narrow)Code:
###Shell###
mkdir -p ~/workspace/module123/peaks
name=MCF10A_H3K27ac
treatment=~/workspace/module123/alignments/${name}_chr19.sorted.dup_marked.dedup.bam
input=~/workspace/module123/alignments/MCF10A_input_chr19.sorted.dup_marked.dedup.bam
macs2 callpeak -t ${treatment} -c ${input} -f BAMPE -g 58617616 -n ${name} --keep-dup all --outdir ~/workspace/module123/peaks/ --bdg 1> ~/workspace/module123/peaks/${name}.out.log 2> ~/workspace/module123/peaks/${name}.err.log
name=MCF10A_H3K4me3
treatment=~/workspace/module123/alignments/${name}_chr19.sorted.dup_marked.dedup.bam
input=~/workspace/module123/alignments/MCF10A_input_chr19.sorted.dup_marked.dedup.bam
macs2 callpeak -t ${treatment} -c ${input} -f BAMPE -g 58617616 -n ${name} --keep-dup all --outdir ~/workspace/module123/peaks/ --bdg 1> ~/workspace/module123/peaks/${name}.out.log 2> ~/workspace/module123/peaks/${name}.err.log
macs2 callpeak -t ${treatment} -c ${input} -f BAMPE -g 58617616 -n ${name} --keep-dup all --outdir ~/workspace/module123/peaks/ --bdg 1> ~/workspace/module123/peaks/${name}.out.log 2> ~/workspace/module123/peaks/${name}.err.log
macs2 callpeak General purpose peak calling mode-t ${treatment} Treatment file, can provide more than one to “pool”-c ${input} Control File/Input-f BAMPE Instructs MACS2 on what kind of file to expect. Single/Paired-end bed/bam-g hs Sets appropriate genome size for background sampling. Typically would set hs=human mm=mouse but in our case we use the HG38 size of chr19-n ${name} name or prefix to use-q 0.05 FDR q value default--outdir ~/workspace/module123/peaks/ where to output files otherwise stores in current working directory--bdg outputs pileup into bedgraph (a BED file where the fourth column is pileup/fragment count/coverage)1> ~/workspace/module123/peaks/${name}.out.log output log2> ~/workspace/module123/peaks/${name}.err.log error logCode:
###Shell###
mkdir -p ~/workspace/module123/peaks
name=MCF10A_H3K27me3
treatment=~/workspace/module123/alignments/${name}_chr19.sorted.dup_marked.dedup.bam
input=~/workspace/module123/alignments/MCF10A_input_chr19.sorted.dup_marked.dedup.bam
macs2 callpeak -t ${treatment} -c ${input} -f BAMPE -g 58617616 -n ${name} --keep-dup all --outdir ~/workspace/module123/peaks/ --bdg --broad 1> ~/workspace/module123/peaks/${name}.out.log 2> ~/workspace/module123/peaks/${name}.err.log
macs2 callpeak -t ${treatment} -c ${input} -f BAMPE -g 58617616 -n ${name} --keep-dup all --outdir ~/workspace/module123/peaks/ --bdg --broad 1> ~/workspace/module123/peaks/${name}.out.log 2> ~/workspace/module123/peaks/${name}.err.log
--broad for broad marks - stitches small peaks togetherbroadPeak vs gappedPeakgapped has the additional columns starting from strand where most of the differences are visualization support for the narrrow peaks within the broadPeak
BAM to BED to easily manipulate coordinatesNote
This step can also be done for chipseq
Code:
name=MCF10A_ATAC
dedup=~/workspace/module123/alignments/${name}_chr19.sorted.dup_marked.dedup.bam
nsort=~/workspace/module123/alignments/${name}_chr19.nsorted.dup_marked.dedup.bam
tags=~/workspace/module123/peaks/${name}_chr19.frags.bed
samtools sort -@ 8 -n ${dedup} -o ${nsort}
bedtools bamtobed -bedpe -mate1 -i ${nsort} > ${tags}
samtools sort -@ 8 -n ${dedup} -o ${nsort}
-n sorts according to read name rather than coordinates by defaultbedtools bamtobed -bedpe -mate1 -i ${nsort} > ${tags}
bedtools bamtobed converts our BAM to BED/coordinates for our reads-bedpe instructs bedtools to report read pairs instead of each read individually-mate1 instructs bedtools to report read1 information first followed by read2chrR1,startR1,endR1,chrR2,startR2,endR2,readName,mapQ,strandR1,strandR2chr19 4534039 4534190 chr19 4534110 4534248 SRR20814384.28 60 + -Note
Tn5 shift can be skipped if one is not interested in footprinting.
Code:
###Shell###
name=MCF10A_ATAC
tags=~/workspace/module123/peaks/${name}_chr19.frags.bed
tn5_tags=~/workspace/module123/peaks/${name}_chr19.frags.tn5.bed
cat ${tags} \
| awk 'BEGIN{{OFS="\t"}}{{printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2,$3,$9,$4,$5,$6,$10}}'\
| awk 'BEGIN{{OFS = "\t"}} {{if ($6 == "+") {{$2 = $2 + 4}} else if ($6 == "-") {{$3 = $3 - 5}} if ($2 >= $3) {{ if ($6 == "+") {{$2 = $3 - 1}} else {{$3 = $2 + 1}} }} print $0}}'\
> ${tn5_tags}
cat ${tags} | awk 'BEGIN{{OFS="\t"}}{{printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2,$3,$9,$4,$5,$6,$10}}' | awk 'BEGIN {{OFS = "\t"}} {{if ($6 == "+") {{$2 = $2 + 4}} else if ($6 == "-") {{$3 = $3 - 5}} if ($2 >= $3) {{ if ($6 == "+") {{$2 = $3 - 1}} else {{$3 = $2 + 1}} }} print $0}}' > ${tn5_tags}
read tagFile | rearrange columns | shift Tag depending on strandawk '
BEGIN{{OFS="\t"}}
{
{
printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",
$1,$2,$3,$9,$4,$5,$6,$10
}
}'BEGIN{{OFS="\t"}} output file as tab delimitedprintf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2... print string where %s is substituted with the next specified element
print chrR1,startR1,endR1,"N","1000",strandR1,chrR2,startR2,endR2,strandR2awk '
BEGIN {{OFS = "\t"}}
{
{
if ($6 == "+")
{{$2 = $2 + 4}}
else if ($6 == "-")
{{$3 = $3 - 5}}
if ($2 >= $3)
{
{
if ($6 == "+")
{{$2 = $3 - 1}}
else
{{$3 = $2 + 1}}
}
} print $0
}
}'
> ${tn5_tags}+ shifted 4 bp to the right- shift 5 bp to the left
4/4 instead of 4/5Code:
###Shell###
name=MCF10A_ATAC
tn5_tags=~/workspace/module123/peaks/${name}_chr19.frags.tn5.bed
macs2 callpeak \
-t ${tn5_tags} \
-f BED \
-n ${name} \
-g 58617616 \
-p 0.01 \
--shift -100 \
--extsize 200 \
--nomodel \
--bdg \
--keep-dup all \
--outdir ~/workspace/module123/peaks/
macs2 callpeak -t /home/ubuntu/workspace/module123/peaks/MCF10A_ATAC_chr19.frags.tn5.bed -f BED -n {name} -g 58617616 -p 0.01 --nomodel --extsize 200 --shift -100 --bdg --keep-dup all
-f BED we specify a BED format input instead of BAM-name ${name} name of file-g 58617616 size of chr19-p 0.01 Pvalue cutoff--nomodel off by default, normally calculates the extsize and shift parameters--extsize 200, b/c Tn5’s activity is on the 5', we extend the 5' to get more representation--shift -100 should be -1*(extsize/2), this focuses MACS on our 5'--bdg generate a pileup bedgraph--keep-dup all retain “duplicates”. As we’ve already filtered out duplicates, MACS2 will call duplicates via genomic coordinates--call-summitsCode:
###Shell###
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O ~/workspace/module123/resources/hg38_blacklist.bed.gz
gunzip ~/workspace/module123/resources/hg38_blacklist.bed.gz
blacklist=~/workspace/module123/resources/hg38_blacklist.bed
sample="MCF10A_H3K27ac_peaks"
bedtools intersect -v -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklistRemoved.narrowPeak
bedtools intersect -u -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklist.narrowPeak
sample="MCF10A_H3K4me3_peaks"
bedtools intersect -v -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklistRemoved.narrowPeak
bedtools intersect -u -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklist.narrowPeak
sample="MCF10A_H3K27me3_peaks"
bedtools intersect -v -a ~/workspace/module123/peaks/${sample}.broadPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklistRemoved.broadPeak
bedtools intersect -u -a ~/workspace/module123/peaks/${sample}.broadPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklist.broadPeak
sample="MCF10A_ATAC_peaks"
bedtools intersect -v -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklistRemoved.narrowPeak
bedtools intersect -u -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist} > ~/workspace/module123/peaks/${sample}.blacklist.narrowPeak
bedtools intersect -u -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist}
bedtools intersect identify elements from fileA and fileB that overlap genomic coordiante wise-u by default, bedtools will output every time a element intersection is detected i.e. if fileA_ele1 overlaps fileB_ele1 and fileB_ele2. The -u instead reports the element once regardless of how many overlapsbedtools intersect -v -a ~/workspace/module123/peaks/${sample}.narrowPeak -b ${blacklist}
-v reverse the behaviour, identify elements that do not overlapCode:
###Shell###
mkdir -p ~/workspace/module123/{bigBed,bigWig}
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes -O workspace/module123/resources/hg38.chrom.sizes
sample="MCF10A_H3K27ac"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bedgraph=~/workspace/module123/peaks/${sample}_treat_pileup.bdg
output_bigwig=~/workspace/module123/bigWig/${sample}_treat_pileup.bw
sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/module123/bigWig/tmp
bedGraphToBigWig ~/workspace/module123/bigWig/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigWig/tmp
input_bedgraph=~/workspace/module123/peaks/MCF10A_H3K27me3_control_lambda.bdg
output_bigwig=~/workspace/module123/bigWig/MCF10A_Input_control_pileup.bw
sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/module123/bigWig/tmp
bedGraphToBigWig ~/workspace/module123/bigWig/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigWig/tmp
sort -k1,1 -k2,2n ${input_bedgraph} sorting the BED file
-k1,1 sort first by chromosome alphabetically-k2,2n sort secondarily by genomic coordinatesbedGraphToBigWig convert bedGraph file to bigWigbedgraph file to a bigWig file.bedgraph is an extension of bed(genomic coordinates) + a column with a numeric value
wig file
Code:
###Shell###
sample="MCF10A_H3K27me3"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bedgraph=~/workspace/module123/peaks/${sample}_treat_pileup.bdg
output_bigwig=~/workspace/module123/bigWig/${sample}_treat_pileup.bw
sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/module123/bigWig/tmp
bedGraphToBigWig ~/workspace/module123/bigWig/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigWig/tmp
sample="MCF10A_H3K4me3"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bedgraph=~/workspace/module123/peaks/${sample}_treat_pileup.bdg
output_bigwig=~/workspace/module123/bigWig/${sample}_treat_pileup.bw
sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/module123/bigWig/tmp
bedGraphToBigWig ~/workspace/module123/bigWig/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigWig/tmp
sample="MCF10A_ATAC"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bedgraph=~/workspace/module123/peaks/${sample}_treat_pileup.bdg
output_bigwig=~/workspace/module123/bigWig/${sample}_treat_pileup.bw
sort -k1,1 -k2,2n ${input_bedgraph} > ~/workspace/module123/bigWig/tmp
bedGraphToBigWig ~/workspace/module123/bigWig/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigWig/tmp
BED filesCode:
###Shell###
mkdir -p ~/workspace/module123/{bigBed,bigWig}
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes -O workspace/module123/resources/hg38.chrom.sizes
sample="MCF10A_H3K27ac"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bed=~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.narrowPeak
output_bigwig=~/workspace/module123/bigBed/${sample}.blacklistRemoved.bb
sort -k1,1 -k2,2n ${input_bed} | cut -f1-3 > ~/workspace/module123/bigBed/tmp
bedToBigBed ~/workspace/module123/bigBed/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigBed/tmp
sort -k1,1 -k2,2n ${input_bedgraph} sorting the BED file
-k1,1 sort first by chromosome alphabetically-k2,2n sort secondarily by genomic coordinatesbedGraphToBigWig convert bedGraph file to bigBedCode:
###Shell###
sample="MCF10A_ATAC"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bed=~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.narrowPeak
output_bigwig=~/workspace/module123/bigBed/${sample}.blacklistRemoved.bb
sort -k1,1 -k2,2n ${input_bed} | cut -f1-3 > ~/workspace/module123/bigBed/tmp
bedToBigBed ~/workspace/module123/bigBed/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigBed/tmp
sample="MCF10A_H3K4me3"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bed=~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.narrowPeak
output_bigwig=~/workspace/module123/bigBed/${sample}.blacklistRemoved.bb
sort -k1,1 -k2,2n ${input_bed} | cut -f1-3 > ~/workspace/module123/bigBed/tmp
bedToBigBed ~/workspace/module123/bigBed/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigBed/tmp
sample="MCF10A_H3K27me3"
chrom_sizes=~/workspace/module123/resources/hg38.chrom.sizes
input_bed=~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.broadPeak
output_bigwig=~/workspace/module123/bigBed/${sample}.blacklistRemoved.bb
sort -k1,1 -k2,2n ${input_bed} | cut -f1-3 > ~/workspace/module123/bigBed/tmp
bedToBigBed ~/workspace/module123/bigBed/tmp ${chrom_sizes} ${output_bigwig}
rm ~/workspace/module123/bigBed/tmp
Code :
###Shell###
mkdir workspace/module123/qc
wget https://www.bcgsc.ca/downloads/esu/touchdown/hg38v79_genes_tss_2000.bed -O workspace/module123/resources/hg38v79_genes_tss_2000.bed
sort -k1,1 -k2,2n workspace/module123/resources/hg38v79_genes_tss_2000.bed > tmp
mv tmp workspace/module123/resources/hg38v79_genes_tss_2000.bed
wget https://www.bcgsc.ca/downloads/esu/touchdown/encode_enhancers_liftover.bed -O workspace/module123/resources/encode_enhancers_liftover.bed
TSS=~/workspace/module123/resources/hg38v79_genes_tss_2000.bed
ENH=~/workspace/module123/resources/encode_enhancers_liftover.bed
sample="MCF10A_H3K27ac"
query_bam=~/workspace/module123/alignments/${sample}_chr19.sorted.dup_marked.bam
samtools view -@4 -q 10 -F 1028 $query_bam -c
samtools view -@4 -q 10 -F 1028 $query_bam -L ${ENH} -c
samtools view -@4 -q 10 -F 1028 $query_bam -L ~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.narrowPeak -c
samtools view -@4 -q 10 -F 1028 $query_bam -L ~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.narrowPeak -c
samtools view parse through our BAM file-@4 resources to use-q 10 filter for reads with mapping quality(MAPQ) >10-F 1028 Remove reads that have the following flags:UNMAP,DUP$query_bam Bam of interest-c count the number of reads that fulfil the criteria-L perform actions on reads that fall within the specified genomic coordiantes (Bed File)Code:
###Shell###
TSS=~/workspace/module123/resources/hg38v79_genes_tss_2000.bed
ENH=~/workspace/module123/resources/encode_enhancers_liftover.bed
for histone in H3K27ac H3K27me3 H3K4me3 ATAC input;
do
sample="MCF10A_${histone}"
query_bam=~/workspace/module123/alignments/${sample}_chr19.sorted.dup_marked.bam
samtools view -@4 -q 10 -F 1028 $query_bam -c > ~/workspace/module123/qc/col_${histone}
samtools view -@4 -q 10 -F 1028 $query_bam -L ${TSS} -c >> ~/workspace/module123/qc/col_${histone}
samtools view -@4 -q 10 -F 1028 $query_bam -L ${ENH} -c >> ~/workspace/module123/qc/col_${histone}
if [[ "$histone" == "H3K27me3" ]]; then
peaks=~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.broadPeak
samtools view -@4 -q 10 -F 1028 $query_bam -L ~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.broadPeak -c >> ~/workspace/module123/qc/col_${histone}
elif [[ "$histone" == "H3K27ac" || "$histone" == "H3K4me3" || "$histone" == "ATAC" ]]; then
peaks=~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.narrowPeak
samtools view -@4 -q 10 -F 1028 $query_bam -L ~/workspace/module123/peaks/${sample}_peaks.blacklistRemoved.narrowPeak -c >> ~/workspace/module123/qc/col_${histone}
else
echo
fi
done
paste ~/workspace/module123/qc/col_H3K27ac ~/workspace/module123/qc/col_H3K27me3 ~/workspace/module123/qc/col_H3K4me3 ~/workspace/module123/qc/col_ATAC ~/workspace/module123/qc/col_input > ~/workspace/module123/qc/integers.tsv
paste \
<(awk '{print $1/442829*100}' ~/workspace/module123/qc/col_H3K27ac) \
<(awk '{print $1/1610910*100}' ~/workspace/module123/qc/col_H3K27me3) \
<(awk '{print $1/1512352*100}' ~/workspace/module123/qc/col_H3K4me3) \
<(awk '{print $1/2751833*100}' ~/workspace/module123/qc/col_ATAC) \
<(awk '{print $1/1023265*100}' ~/workspace/module123/qc/col_input) > ~/workspace/module123/qc/percentages.tsv
paste ~/workspace/module123/qc/col_H3K27ac ~/workspace/module123/qc/col_H3K27me3 ~/workspace/module123/qc/col_H3K4me3 ~/workspace/module123/qc/col_ATAC ~/workspace/module123/qc/col_input > ~/workspace/module123/qc/integers.tsv
- paste lets us aggregate our results where each is a column)
-
- Reads enrichment in key region
| H3K27ac | H3K27me3 | H3K4me3 | ATAC | Input | |
|---|---|---|---|---|---|
| Total | 442829 | 1610910 | 1512352 | 2751833 | 1023265 |
| TSS | 127577 | 298326 | 1087032 | 924326 | 194996 |
| Enhancer | 242489 | 760089 | 834053 | 1721794 | 514115 |
| In Peaks | 69584 | 783294 | 1239717 | 1082674 |
| H3K27ac | H3K27me3 | H3K4me3 | ATAC | Input | |
|---|---|---|---|---|---|
| TSS | 28.8095 | 18.5191 | 71.8769 | 33.5895 | 19.0563 |
| Enhancer | 54.7591 | 47.1838 | 55.1494 | 62.569 | 50.2426 |
| In Peaks | 15.7135 | 48.6243 | 81.9728 | 39.3437 |
H3K27ac FRIP is poor as a result of the poorer sequencing depth.H3K4me3 is performing well: high FRIP and enriched in promoter regions####TSS+/-2kb
mkdir workspace/module123/qc
wget https://www.bcgsc.ca/downloads/esu/touchdown/hg38v79_genes_tss_2000.bed -O workspace/module123/resources/hg38v79_genes_tss_2000.bed
sort -k1,1 -k2,2n workspace/module123/resources/hg38v79_genes_tss_2000.bed > tmp
mv tmp workspace/module123/resources/hg38v79_genes_tss_2000.bed
####Enhancer liftover
wget https://www.bcgsc.ca/downloads/esu/touchdown/encode_enhancers_liftover.bed -O workspace/module123/resources/encode_enhancers_liftover.bed
####Blacklist
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O ~/workspace/module123/resources/hg38_blacklist.bed.gz
gunzip ~/workspace/module123/resources/hg38_blacklist.bed.gz
hg38v79_genes_tss_2000.bed
encode_enhancers_liftover.bed
ls ~/CourseData/EPI_data/module123/encode_bed
basal.H3K27ac.peak_calls.bed
basal.H3K27me3.peak_calls.bed
basal.H3K4me1.peak_calls.bed
basal.H3K4me3.peak_calls.bed
lp.H3K27ac.peak_calls.bed
lp.H3K27me3.peak_calls.bed
lp.H3K4me1.peak_calls.bed
lp.H3K4me3.peak_calls.bed
luminal.H3K27ac.peak_calls.bed
luminal.H3K27me3.peak_calls.bed
luminal.H3K4me1.peak_calls.bed
luminal.H3K4me3.peak_calls.bed
stromal.H3K27ac.peak_calls.bed
stromal.H3K27me3.peak_calls.bed
stromal.H3K4me1.peak_calls.bed
stromal.H3K4me3.peak_calls.bed
ls ~/CourseData/EPI_data/module123/encode_bigWig
basal.H3K27ac.signal_unstranded.bigWig
basal.H3K27me3.signal_unstranded.bigWig
basal.H3K4me1.signal_unstranded.bigWig
basal.H3K4me3.signal_unstranded.bigWig
lp.H3K27ac.signal_unstranded.bigWig
lp.H3K27me3.signal_unstranded.bigWig
lp.H3K4me1.signal_unstranded.bigWig
lp.H3K4me3.signal_unstranded.bigWig
luminal.H3K27ac.signal_unstranded.bigWig
luminal.H3K27me3.signal_unstranded.bigWig
luminal.H3K4me1.signal_unstranded.bigWig
luminal.H3K4me3.signal_unstranded.bigWig
stromal.H3K27ac.signal_unstranded.bigWig
stromal.H3K27me3.signal_unstranded.bigWig
stromal.H3K4me1.signal_unstranded.bigWig
stromal.H3K4me3.signal_unstranded.bigWig
ls ~/CourseData/EPI_data/module123/fastq
MCF10A.ATAC.chr19.R1.fastq.gz
MCF10A.ATAC.chr19.R2.fastq.gz
MCF10A.H3K27ac.chr19.R1.fastq.gz
MCF10A.H3K27ac.chr19.R2.fastq.gz
MCF10A.H3K27me3.chr19.R1.fastq.gz
MCF10A.H3K27me3.chr19.R2.fastq.gz
MCF10A.H3K4me3.chr19.R1.fastq.gz
MCF10A.H3K4me3.chr19.R2.fastq.gz
MCF10A.Input.chr19.R1.fastq.gz
MCF10A.Input.chr19.R2.fastq.gz
CourseData/EPI_data/module123/triplicates/triplicates.csv
CourseData/EPI_data/module123/triplicates/alignments:
MCF10A_H3K4me3_chr19.CondA.Rep1.bam MCF10A_H3K4me3_chr19.CondB.Rep2.bam MCF10A_input_chr19.CondA.Rep3.bam
MCF10A_H3K4me3_chr19.CondA.Rep1.bam.bai MCF10A_H3K4me3_chr19.CondB.Rep2.bam.bai MCF10A_input_chr19.CondA.Rep3.bam.bai
MCF10A_H3K4me3_chr19.CondA.Rep2.bam MCF10A_H3K4me3_chr19.CondB.Rep3.bam MCF10A_input_chr19.CondB.Rep1.bam
MCF10A_H3K4me3_chr19.CondA.Rep2.bam.bai MCF10A_H3K4me3_chr19.CondB.Rep3.bam.bai MCF10A_input_chr19.CondB.Rep1.bam.bai
MCF10A_H3K4me3_chr19.CondA.Rep3.bam MCF10A_input_chr19.CondA.Rep1.bam MCF10A_input_chr19.CondB.Rep2.bam
MCF10A_H3K4me3_chr19.CondA.Rep3.bam.bai MCF10A_input_chr19.CondA.Rep1.bam.bai MCF10A_input_chr19.CondB.Rep2.bam.bai
MCF10A_H3K4me3_chr19.CondB.Rep1.bam MCF10A_input_chr19.CondA.Rep2.bam MCF10A_input_chr19.CondB.Rep3.bam
MCF10A_H3K4me3_chr19.CondB.Rep1.bam.bai MCF10A_input_chr19.CondA.Rep2.bam.bai MCF10A_input_chr19.CondB.Rep3.bam.bai
CourseData/EPI_data/module123/triplicates/bigWig:
CondA.Rep1.bw CondA.Rep2.bw CondA.Rep3.bw CondB.Rep1.bw CondB.Rep2.bw CondB.Rep3.bw
CourseData/EPI_data/module123/triplicates/peaks:
CondA.Rep1_peaks.narrowPeak CondA.Rep3_peaks.narrowPeak CondB.Rep2_peaks.narrowPeak
CondA.Rep2_peaks.narrowPeak CondB.Rep1_peaks.narrowPeak CondB.Rep3_peaks.narrowPeak
Lab Completed!
Congratulations! You have completed Lab 2!
Code:
Like this! This is the main code to run for the step.
Additionally the code block will include a header to indicate what environment to the run code for example:
###Shell###
pwd
###R###
getwd()
pwd & getwd()
Note
Important points and considerations will also be raised as so.
Code
###Shell###
cp ~/CourseData/EPI_data/module123/encode_bigWig/*H3K4me3* ~/workspace/module123/bigWig/
cp ~/CourseData/EPI_data/module123/encode_bigBed/*H3K4me3* ~/workspace/module123/bigBed/
cp ~/CourseData/EPI_data/module123/triplicates/bigWig/* ~/workspace/module123/bigWig/
Code
###Shell###
MCF10A_H3K27ac=~/workspace/module123/peaks/MCF10A_H3K27ac_peaks.blacklistRemoved.narrowPeak
MCF10A_H3K27me3=~/workspace/module123/peaks/MCF10A_H3K27me3_peaks.blacklistRemoved.broadPeak
MCF10A_H3K4me3=~/workspace/module123/peaks/MCF10A_H3K4me3_peaks.blacklistRemoved.narrowPeak
bedtools intersect -u -a ${MCF10A_H3K27ac} -b ${MCF10A_H3K27me3} | wc -l
bedtools intersect -u -a ${MCF10A_H3K27ac} -b ${MCF10A_H3K4me3} | wc -l
bedtools intersect -u -a ${MCF10A_H3K27ac} -b ${MCF10A_H3K27me3} | wc -l
6/988bedtools intersect -u -a ${MCF10A_H3K27ac} -b ${MCF10A_H3K4me3} | wc -l
789/988Code
###Shell###
basal_H3K27ac=~/CourseData/EPI_data/module123/encode_bed/basal.H3K27ac.peak_calls.bed
luminal_H3K27ac=~/CourseData/EPI_data/module123/encode_bed/luminal.H3K27ac.peak_calls.bed
stromal_H3K27ac=~/CourseData/EPI_data/module123/encode_bed/stromal.H3K27ac.peak_calls.bed
lp_H3K27ac=~/CourseData/EPI_data/module123/encode_bed/lp.H3K27ac.peak_calls.bed
MCF10A_H3K27ac=~/workspace/module123/peaks/MCF10A_H3K27ac_peaks.blacklistRemoved.narrowPeak
paste \
<(ls ~/CourseData/EPI_data/module123/encode_bed/*H3K4me3* | xargs -I {} sh -c "bedtools intersect -u -a ~/workspace/module123/peaks/MCF10A_H3K4me3_peaks.blacklistRemoved.narrowPeak -b {} | wc -l") \
<(ls ~/CourseData/EPI_data/module123/encode_bed/*H3K27ac* | xargs -I {} sh -c "bedtools intersect -u -a ~/workspace/module123/peaks/MCF10A_H3K27ac_peaks.blacklistRemoved.narrowPeak -b {} | wc -l") \
<(ls ~/CourseData/EPI_data/module123/encode_bed/*H3K27me3* | xargs -I {} sh -c "bedtools intersect -u -a ~/workspace/module123/peaks/MCF10A_H3K27me3_peaks.blacklistRemoved.broadPeak -b {} | wc -l")
paste \ <(ls CourseData/module123/bed/*H3K4me3* | xargs -I {} sh -c "bedtools intersect -u -a workspace/module123/peaks/MCF10A_H3K4me3_peaks.blacklistRemoved.narrowPeak -b {} | wc -l") \ <(ls CourseData/module123/bed/*H3K27ac* | xargs -I {} sh -c "bedtools intersect -u -a workspace/module123/peaks/MCF10A_H3K27ac_peaks.blacklistRemoved.narrowPeak -b {} | wc -l")
paste lets us aggregate our results where each column is the output from the command within <(COMMAND)<(COMMAND), contains the following LOOKUP | FORLOOP intersect and count\ let’s continue our command on another line| H3K4me3 | H3K27ac | H3K27me3 | |
|---|---|---|---|
| MCF10A | 1406 | 988 | 4797 |
| Intersecting Basal | 1039 | 656 | 2420 |
| Intersecting Luminal Progenitor | 1099 | 778 | 2496 |
| Intersecting Luminal | 1024 | 766 | 2430 |
| Intersecting Stromal | 978 | 717 | 2604 |
Code:
###Shell###
basal_H3K4me3=~/CourseData/EPI_data/module123/encode_bed/basal.H3K4me3.peak_calls.bed
luminal_H3K4me3=~/CourseData/EPI_data/module123/encode_bed/luminal.H3K4me3.peak_calls.bed
stromal_H3K4me3=~/CourseData/EPI_data/module123/encode_bed/stromal.H3K4me3.peak_calls.bed
lp_H3K4me3=~/CourseData/EPI_data/module123/encode_bed/lp.H3K4me3.peak_calls.bed
MCF10A_H3K4me3=~/workspace/module123/peaks/MCF10A_H3K4me3_peaks.blacklistRemoved.narrowPeak
bedtools intersect -u -a ${MCF10A_H3K4me3} -b ${lp_H3K4me3} | bedtools intersect -v -a stdin -b ${basal_H3K4me3} ${luminal_H3K4me3} ${stromal_H3K4me3} | wc -l
bedtools intersect -u -a ${MCF10A_H3K4me3} -b ${lp_H3K4me3} |\
bedtools intersect -u -a stdin -b ${basal_H3K4me3} |\
bedtools intersect -u -a stdin -b ${luminal_H3K4me3} |\
bedtools intersect -u -a stdin -b ${stromal_H3K4me3} | wc -l
bedtools intersect -u -a ${MCF10A_H3K4me3} -b ${lp_H3K4me3} | bedtools intersect -v -a stdin -b ${basal_H3K4me3} ${luminal_H3K4me3} ${stromal_H3K4me3}| wc -l
Common with LP | Not found in Basal OR Luminal OR stromalstdin takes the results from half of our command and utilizies as input in the nextstdin. Check documentation first.bedtools intersect -u -a ${MCF10A_H3K4me3} -b ${lp_H3K4me3} | bedtools intersect -u -a stdin -b {basal_H3K4me3} | bedtools intersect -u -a stdin -b ${luminal_H3K4me3} | bedtools intersect -u -a stdin -b ${stromal_H3K4me3} | wc -l
MCF10A Common with LP | Common with Basal | Common with Luminal | StromalCode:
###Shell###
condA_rep1=~/CourseData/EPI_data/module123/triplicates/peaks/CondA.Rep1_peaks.narrowPeak
condB_rep1=~/CourseData/EPI_data/module123/triplicates/peaks/CondB.Rep1_peaks.narrowPeak
condA_rep2=~/CourseData/EPI_data/module123/triplicates/peaks/CondA.Rep2_peaks.narrowPeak
condB_rep2=~/CourseData/EPI_data/module123/triplicates/peaks/CondB.Rep2_peaks.narrowPeak
condA_rep3=~/CourseData/EPI_data/module123/triplicates/peaks/CondA.Rep3_peaks.narrowPeak
condB_rep3=~/CourseData/EPI_data/module123/triplicates/peaks/CondB.Rep3_peaks.narrowPeak
bedtools intersect -u -a ${condA_rep1} -b ${condA_rep2} | wc -l
bedtools intersect -u -a ${condA_rep1} -b ${condB_rep2} | wc -l
bedtools intersect -v -a ${condA_rep1} -b ${condA_rep2} | wc -l
bedtools intersect -v -a ${condA_rep1} -b ${condB_rep2} | wc -l
bedtools intersect -u -a ${condA_rep1} -b ${condA_rep2} ${condA_rep3} | wc -l
bedtools intersect -u -a ${condA_rep1} -b ${condA_rep2} ${condA_rep3} -f 0.5 -F 0.5 | wc -l
bedtools intersect -wao -a ${condA_rep1} -b ${condA_rep2} ${condA_rep3} | head
bedtools intersect -u -a ${condA_rep1} -b ${condA_rep2} | wc -l
1191bedtools intersect -u -a ${condA_rep1} -b ${condB_rep2} | wc -l
1093bedtools intersect -v -a ${condA_rep1} -b ${condA_rep2} | wc -l
50bedtools intersect -v -a ${condA_rep1} -b ${condB_rep2} | wc -l
148bedtools intersect -u -a ${condA_rep1} -b ${condA_rep2} ${condA_rep3} | wc -l
bedtools intersect -wao -a ${condA_rep1} -b ${condA_rep2} ${condA_rep3} | head
-wao returns the original line of ${condA_rep1} and the element it intersectsbedtools intersect -u -a ${condA_rep1} -b ${condA_rep2} ${condA_rep3} -f 0.5 -F 0.5 | wc -l
-f 0.5 adds the conditions that inte##rsects are only counted when 50% overlap of A occurs-F 0.5 adds the conditions that intersects are only counted when 50% overlap of B occursbedtools intersect -wao -a ${condA_rep1} -b ${condA_rep2} ${condA_rep3} | head
-wao returns the original line of ${condA_rep1} and the element it intersectsCode:
###Shell###
MCF10A_H3K27ac=~/workspace/module123/peaks/MCF10A_H3K27ac_peaks.blacklistRemoved.narrowPeak
TSS=~/workspace/module123/resources/hg38v79_genes_tss_2000.bed
bedtools closest -a ${MCF10A_H3K27ac} -b ${TSS} -d | head
###
condA_peaks=~/CourseData/EPI_data/module123/triplicates/peaks/CondA.Rep1_peaks.narrowPeak
condB_peaks=~/CourseData/EPI_data/module123/triplicates/peaks/CondB.Rep1_peaks.narrowPeak
cat ${condA_peaks} ${condB_peaks} | sort -k1,1 -k2,2n | bedtools merge -i stdin > ~/workspace/module123/deeptools/merged_peaks.bed
####
methylation=~/workspace/module123/resources/example_methylation.bed
echo chr19 1 58617616 |\
sed 's/ /\t/g' |\
bedtools makewindows -w 50 -b stdin |\
awk 'BEGIN{{srand(1)}}{print $0"\t"rand()}' \
> ${methylation}
bedtools map -a ${MCF10A_H3K27ac} -b ${methylation} -c 4 -o median,count | head
sed 's/ /\t/g' replace a with \tbedtools closest -a ${MCF10A_H3K27ac} -b ${TSS} -d | head
-d will make the tool report the distancecat ${condA_rep1} ${condA_rep2} ${condA_rep3} | sort -k1,1 -k2,2n | bedtools merge -i stdin | wc -l
read peak files | sort peak files | merge peak files | line countbedtools merge takes elements specified in the input and merges the features if they intersectcat ${condA_rep1} ${condA_rep2} ${condA_rep3} starts off withecho chr19\\t1\\t58617616 | bedtools makewindows -w 50 -b stdin | awk 'BEGIN{srand(1);}{print $0"\t"rand()}' > ${metylation}
simulate a bed file of chr19 | turn bedFile into 50bp bins | add random float valueecho chr19\\t1\\t58617616
\\t an escape character is uses to generate a tabbedtools makewindows -w 50 -b stdin
-w 50 specifies our window size-n to specified how many windows we want to divide our bedFile intoawk 'BEGIN{srand(1);}{print $0"\t"rand()}'
srand(1) set our seed number. Changing this will affect our pseudo random numbersprint $0"\t"rand() as awk process line by line, print the current line (the windowed genomic coordiantes) and a random float numberbedtools map -a ${MCF10A_H3K27ac} -b ${methylation} -c 4 -o median,count
bedtools map apply a function summarizing the values of fileB that intersect fileA-c 4 indicates which columns from fileB we’d like to use-o median,count return the median of col 4 and coutn the number of elements from fileB that intersected the particular element from fileAdiffBind package in RCode :
###R###
library(DiffBind)
setwd("/home/ubuntu")
samples <- read.csv("CourseData/EPI_data/module123/triplicates/triplicates.csv")
MCF10A <- dba(sampleSheet=samples)
MCF10A <- dba.count(MCF10A, bUseSummarizeOverlaps=TRUE)
dba.plotPCA(MCF10A, attributes=DBA_CONDITION,label=DBA_ID)
plot(MCF10A)
MCF10A <- dba.contrast(MCF10A, categories=DBA_CONDITION)
MCF10A <- dba.analyze(MCF10A, method=DBA_EDGER)
analyzed_peaks <- dba.report(MCF10A, method=DBA_EDGER, fold=1)
dba.plotMA(MCF10A, bXY=TRUE , method=DBA_EDGER, fold=1)
write.table(analyzed_peaks, file="workspace/module123/diffBind/differential_peaks.tsv", sep="\t", quote=F, row.names=F, col.names=F)
library(DiffBind)
setwd("/home/ubuntu")
read.csv("CourseData/EPI_data/module123/triplicates/triplicates.csv")
MCF10A <- dba(sampleSheet=samples)
dba object that will be saved as MCF10AMCF10A <- dba.count(MCF10A, bUseSummarizeOverlaps=TRUE)
bUseSummarizeOverlaps=TRUE indicates the counting module to be used. SummarizeOverlaps comes from GenomicAlignments.dba.plotPCA(MCF10A, attributes=DBA_CONDITION,label=DBA_ID)
MCF10A where the annotations are DBA_CONDITION and the labelling is DBA_IDplot(MCF10A)
MCF10A <- dba.contrast(MCF10A, categories=DBA_CONDITION)
categories=DBA_CONDITION the category we want to compareMCF10A <- dba.analyze(MCF10A, method=DBA_EDGER)
contrast we previously established.method=DBA_EDGER, analysis engine is a library called edgeRdeseq2 does not work. Deseq2 has a built in check for variablity which our synthetic dataset is lackinganalyzed_peaks <- dba.report(MCF10A, method=DBA_EDGER, fold=1)
DBA_EDGER to be significant and have an absolute fold change >1dba.plotMA(MCF10A, bXY=TRUE , method=DBA_EDGER, fold=1)
method=DBA_EDGER fetch results based on our previous analysis using edgeRbXY=TRUE produces a scatter plot, FALSE produces a MA plotfold=1 report differential positions that meet fold change thresholdwrite.table(analyzed_peaks, file="workspace/module123/diffBind/differential_peaks.tsv", sep="\t", quote=F, row.names=F, col.names=T)
sep="\t" the seperator to be usedcol.names=T include column namesrow.names=F include row namesquote=F if we want to include quotations around valuesCode:
###Shell###
condA_peaks=~/CourseData/EPI_data/module123/triplicates/peaks/CondA.Rep1_peaks.narrowPeak
condB_peaks=~/CourseData/EPI_data/module123/triplicates/peaks/CondB.Rep1_peaks.narrowPeak
cat ${condA_peaks} ${condB_peaks} | sort -k1,1 -k2,2n | bedtools merge -i stdin > ~/workspace/module123/edgeR/merged_peaks.bed
cat ${condA_peaks} ${condB_peaks} | sort -k1,1 -k2,2n | bedtools merge -i stdin > merged_peaks.bed
Read our peaks | sort peaks coordinate wise | merge peaksCode:
###Shell###
condA_peaks=~/CourseData/EPI_data/module123/triplicates/peaks/CondA.Rep1_peaks.narrowPeak
condB_peaks=~/CourseData/EPI_data/module123/triplicates/peaks/CondB.Rep1_peaks.narrowPeak
mkdir ~/workspace/module123/edgeR/
cat ${condA_peaks} ${condB_peaks} | sort -k1,1 -k2,2n | bedtools merge -i stdin > ~/workspace/module123/edgeR/merged_peaks.bed
cat ${condA_peaks} ${condB_peaks} | sort -k1,1 -k2,2n | bedtools merge -i stdin > merged_peaks.bed
Read our peaks | sort peaks coordinate wise | merge peaksCode:
###Shell###
peaks=~/workspace/module123/edgeR/merged_peaks.bed
condA_bam=~/workspace/module123/alignments/MCF10A_H3K4me3_chr19.CondA.Rep1.bam
condB_bam=~/workspace/module123/alignments/MCF10A_H3K4me3_chr19.CondB.Rep1.bam
condA_count=~/workspace/module123/edgeR/MCF10A_H3K4me3_chr19.CondA.Rep1.bed
condB_count=~/workspace/module123/edgeR/MCF10A_H3K4me3_chr19.CondB.Rep1.bed
bedtools intersect -a ${peaks} -b ${condA_bam} -c > ${condA_count}
bedtools intersect -a ${peaks} -b ${condB_bam} -c > ${condB_count}
bedtools intersect -a ${peaks} -b ${condA_bam} -c > ${condA_count}
BAM file-c reports counts of our BAM that overlap our peaksR and perform a statistical analysisCode:
###R### setwd(“/home/ubuntu”) library(edgeR) library(dplyr)
condA<-read.csv(“workspace/module123/edgeR/MCF10A_H3K4me3_chr19.CondA.Rep1.bed”,sep=‘,col.names = c(“chr”, “start”, “end”,“MCF10A_H3K4me3_chr19.CondA.Rep1”),colClasses= c(“character”,“character”,“character”,“numeric”)) condB<-read.csv(“workspace/module123/edgeR/MCF10A_H3K4me3_chr19.CondB.Rep1.bed”,sep=’,col.names = c(“chr”, “start”, “end”,“MCF10A_H3K4me3_chr19.CondB.Rep1”),colClasses= c(“character”,“character”,“character”,“numeric”))
peaks<-data.frame( MCF10A_H3K4me3_chr19.CondA.Rep1=condA\(MCF10A_H3K4me3_chr19.CondA.Rep1, MCF10A_H3K4me3_chr19.CondB.Rep1=condB\)MCF10A_H3K4me3_chr19.CondB.Rep1 ) row.names(peaks)<-paste(condA\(chr,condA\)start,condA$end,sep=’_’)
edger_dl <- DGEList(counts=peaks, group=1:2,lib.size=c(1131503,1266436))
edger_tmm <- calcNormFactors(edger_dl, method = “TMM”)
bvc=0.1
edger_et <- exactTest(edger_tmm,dispersion=bvc^2)
edger_tp <- topTags(edger_et, n=nrow(edger_et$table),adjust.method=“BH”)
de <- edger_tp$table %>% filter(FDR < 0.01) %>% filter(logFC >=1 | logFC <=-1)
write.table(de, file=“workspace/module123/edgeR/differential_peaks_edger.tsv”, sep=“, quote=F, row.names=F, col.names=F)
- `condA<-read.csv("workspace/module123/edgeR/MCF10A_H3K4me3_chr19.CondA.Rep1.bed",sep='\t',col.names = c("chr", "start", "end","MCF10A_H3K4me3_chr19.CondA.Rep1"),colClasses= c("character","character","character","numeric"))`
- `read.csv` read the `CSV` file into a `data.frame`
- `sep='\t'` specify the delimiter
- `col.names = c("chr", "start", "end","MCF10A_H3K4me3_chr19.CondA.Rep1")` set our column names
- `colClasses= c("character","character","character","numeric")` indicate with column are what datatypes
- `peaks<-data.frame(MCF10A_H3K4me3_chr19.CondA.Rep1=condA$MCF10A_H3K4me3_chr19.CondA.Rep1,MCF10A_H3K4me3_chr19.CondB.Rep1=condB$MCF10A_H3K4me3_chr19.CondB.Rep1)`
- make a new data.frame using our two columns from condA and condB
- `row.names(peaks)<-paste(peaks$chr,peaks$start,peaks$end,sep='_')`
- edit row names to be match genomic coordinates
- ``row.names(peaks)<-` indicate we want to overwrite exist row names
- `paste(peaks$chr,peaks$start,peaks$end,sep='_')` we want to concatenate our `chr`,`start` and `end` with `_` as a seperator
- `edger_dl <- DGEList(counts=peaks, group=1:2,lib.size=c(1131503,1266436))`
- `DGEList(counts=peaks_clean, group=1:2)` read in our `peak_clean` data.frame
- `group=1:2` identify which groups we want to contrast
- `lib.size=c(1131503,1266436)` library size for the two conditions
- `edger_tmm <- calcNormFactors(edger_dl, method = "TMM")` calculate the normalization factor to scale library sizes
- `bvc=0.01` Set the `square-rootdispersion`. according to edgeR documentation: `from well-controlled experiments are 0.4 for human data, 0.1 for data on genetically identical model organisms or 0.01 for technical replicates`
- `edger_et <- exactTest(edger_dl,dispersion=bcv^2)`
- calculate `FC` and `Pvalue` per row
- `edger_tp <- topTags(edger_et, n=nrow(edger_et$table),adjust.method="BH")`
- calcualte `FDR` via Benjamini-Hochberg
- `de <- edger_tp$table %>% filter(FDR < 0.01) %>% filter(logFC >=1 | logFC <=-1)`
- `filter(FDR < 0.01)` filter for significant peaks
- `filter(logFC >=1 | logFC <=-1)` filter for peaks with appropriate fold change
- `write.table(de, file="workspace/module123/edgeR/differential_peaks_edger.tsv", sep="\t", quote=F, row.names=F, col.names=F)`
- save files
<img src="https://github.com/bioinformaticsdotca/EPI_2023/blob/module123/module123_images/edger.png?raw=true" alt="Region" width="750" />
###TSS+/-2kb
mkdir workspace/module123/qc
wget https://www.bcgsc.ca/downloads/esu/touchdown/hg38v79_genes_tss_2000.bed -O workspace/module123/resources/hg38v79_genes_tss_2000.bed
sort -k1,1 -k2,2n workspace/module123/resources/hg38v79_genes_tss_2000.bed > tmp
mv tmp workspace/module123/resources/hg38v79_genes_tss_2000.bed
###Enhancer liftover
wget https://www.bcgsc.ca/downloads/esu/touchdown/encode_enhancers_liftover.bed -O workspace/module123/resources/encode_enhancers_liftover.bed
###Blacklist
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O ~/workspace/module123/resources/hg38_blacklist.bed.gz
gunzip ~/workspace/module123/resources/hg38_blacklist.bed.gz
hg38v79_genes_tss_2000.bed
encode_enhancers_liftover.bed
ls ~/CourseData/EPI_data/module123/encode_bed
basal.H3K27ac.peak_calls.bed
basal.H3K27me3.peak_calls.bed
basal.H3K4me1.peak_calls.bed
basal.H3K4me3.peak_calls.bed
lp.H3K27ac.peak_calls.bed
lp.H3K27me3.peak_calls.bed
lp.H3K4me1.peak_calls.bed
lp.H3K4me3.peak_calls.bed
luminal.H3K27ac.peak_calls.bed
luminal.H3K27me3.peak_calls.bed
luminal.H3K4me1.peak_calls.bed
luminal.H3K4me3.peak_calls.bed
stromal.H3K27ac.peak_calls.bed
stromal.H3K27me3.peak_calls.bed
stromal.H3K4me1.peak_calls.bed
stromal.H3K4me3.peak_calls.bed
ls ~/CourseData/EPI_data/module123/encode_bigWig
basal.H3K27ac.signal_unstranded.bigWig
basal.H3K27me3.signal_unstranded.bigWig
basal.H3K4me1.signal_unstranded.bigWig
basal.H3K4me3.signal_unstranded.bigWig
lp.H3K27ac.signal_unstranded.bigWig
lp.H3K27me3.signal_unstranded.bigWig
lp.H3K4me1.signal_unstranded.bigWig
lp.H3K4me3.signal_unstranded.bigWig
luminal.H3K27ac.signal_unstranded.bigWig
luminal.H3K27me3.signal_unstranded.bigWig
luminal.H3K4me1.signal_unstranded.bigWig
luminal.H3K4me3.signal_unstranded.bigWig
stromal.H3K27ac.signal_unstranded.bigWig
stromal.H3K27me3.signal_unstranded.bigWig
stromal.H3K4me1.signal_unstranded.bigWig
stromal.H3K4me3.signal_unstranded.bigWig
ls ~/CourseData/EPI_data/module123/fastq
MCF10A.ATAC.chr19.R1.fastq.gz
MCF10A.ATAC.chr19.R2.fastq.gz
MCF10A.H3K27ac.chr19.R1.fastq.gz
MCF10A.H3K27ac.chr19.R2.fastq.gz
MCF10A.H3K27me3.chr19.R1.fastq.gz
MCF10A.H3K27me3.chr19.R2.fastq.gz
MCF10A.H3K4me3.chr19.R1.fastq.gz
MCF10A.H3K4me3.chr19.R2.fastq.gz
MCF10A.Input.chr19.R1.fastq.gz
MCF10A.Input.chr19.R2.fastq.gz
CourseData/EPI_data/module123/triplicates/triplicates.csv
CourseData/EPI_data/module123/triplicates/alignments:
MCF10A_H3K4me3_chr19.CondA.Rep1.bam MCF10A_H3K4me3_chr19.CondB.Rep2.bam MCF10A_input_chr19.CondA.Rep3.bam
MCF10A_H3K4me3_chr19.CondA.Rep1.bam.bai MCF10A_H3K4me3_chr19.CondB.Rep2.bam.bai MCF10A_input_chr19.CondA.Rep3.bam.bai
MCF10A_H3K4me3_chr19.CondA.Rep2.bam MCF10A_H3K4me3_chr19.CondB.Rep3.bam MCF10A_input_chr19.CondB.Rep1.bam
MCF10A_H3K4me3_chr19.CondA.Rep2.bam.bai MCF10A_H3K4me3_chr19.CondB.Rep3.bam.bai MCF10A_input_chr19.CondB.Rep1.bam.bai
MCF10A_H3K4me3_chr19.CondA.Rep3.bam MCF10A_input_chr19.CondA.Rep1.bam MCF10A_input_chr19.CondB.Rep2.bam
MCF10A_H3K4me3_chr19.CondA.Rep3.bam.bai MCF10A_input_chr19.CondA.Rep1.bam.bai MCF10A_input_chr19.CondB.Rep2.bam.bai
MCF10A_H3K4me3_chr19.CondB.Rep1.bam MCF10A_input_chr19.CondA.Rep2.bam MCF10A_input_chr19.CondB.Rep3.bam
MCF10A_H3K4me3_chr19.CondB.Rep1.bam.bai MCF10A_input_chr19.CondA.Rep2.bam.bai MCF10A_input_chr19.CondB.Rep3.bam.bai
CourseData/EPI_data/module123/triplicates/bigWig:
CondA.Rep1.bw CondA.Rep2.bw CondA.Rep3.bw CondB.Rep1.bw CondB.Rep2.bw CondB.Rep3.bw
CourseData/EPI_data/module123/triplicates/peaks:
CondA.Rep1_peaks.narrowPeak CondA.Rep3_peaks.narrowPeak CondB.Rep2_peaks.narrowPeak
CondA.Rep2_peaks.narrowPeak CondB.Rep1_peaks.narrowPeak CondB.Rep3_peaks.narrowPeak
Lab Completed!
Congratulations! You have completed Lab 3!
This module will cover the basics of Whole Genome Bisulfite-Sequencing (WGBS) data analysis including data visualization in IGV.
BismarkBismarkIGV genome browserMethylKitmkdir -p ~/workspace/module4
cd ~/workspace/module4
GENOME=~/CourseData/EPI_data/module4/Homo_sapiens.GRCh38.chr19
export GENOME
This will define a variable $GENOME that will simplify future commands.
WGBS_DATA=~/CourseData/EPI_data/module4/data
export WGBS_DATA
This will define a variable $WGBS_DATA that will simplify future commands.
Type the following command: ls $WGBS_DATA, what do you see?
You should see something similar to this
WGBS.A34002.137160.chr19.1.fastq.gz
WGBS.A34002.137160.chr19.2.fastq.gz
WGBS.A34002.137487.chr19.1.fastq.gz
WGBS.A34002.137487.chr19.2.fastq.gz
WGBS.A34002.137488.chr19.1.fastq.gz
WGBS.A34002.137488.chr19.2.fastq.gz
These are the files that will be used for the workshop. They contain a subset of WGBS reads from CEMT sample CEMT0007, which is a mammary gland epithelial cell line (more information here).
What do the “.1” and “.2” in the file names mean?
They represent the read1 and read2 of the
paired end reads.
We will now process and map the reads using the Bismark WGBS aligner (more info here).
To simplify the work, we will process the datasets one at a time. To align the first dataset, do the following:
cd ~/workspace/module4
bismark --multicore 4 --bowtie2 $GENOME/genome/bismark_index \
-1 $WGBS_DATA/WGBS.A34002.137160.chr19.1.fastq.gz -2 $WGBS_DATA/WGBS.A34002.137160.chr19.2.fastq.gz
What do all the options in the command mean? (Hint check the help by using bismark --help)
The --multicore 4 option is to do multithreaded
processing to improve speed.
The --bowtie2 option is to use the mapping algorithm
from bowtie2.
The $GENOME/genome/bismark_index specifies the location
of the index for the reference genome to use. This uses the
$GENOME variable we defined previously.
The -1 $WGBS_DATA/WGBS.A34002.137160.chr19.1.fastq.gz
specifies the location of read 1. Idem for -2 which
specifies read 2. This uses the $WGBS_DATA variable we
defined previously.
For more details, please refer to the Bismark user guide.
This step will take a few minutes to run for this reduced dataset. A dataset spanning a full genome will take several hours.
For your own datasets, make sure you have enough computing walltime to run the alignment.
While you wait for the results, ask any questions you have up to this point to the instructors.
At the end of the alignment, you should have the following files saved into your workshop folder:
{:output} WGBS.A34002.137160.chr19.1_bismark_bt2_pe.bam WGBS.A34002.137160.chr19.1_bismark_bt2_PE_report.txt
Bismark offers a tool to generate an interactive HTML report for each sample, which we can now run to obtain the summary for our first sample.
bismark2report
This will produce an additional file called: WGBS.A34002.137160.chr19.1_bismark_bt2_PE_report.html.
Let’s look at the report. We can read the text file directly on the command line:
less WGBS.A34002.137160.chr19.1_bismark_bt2_PE_report.txt
Or open the HTML report using your internet browser and the IP address of your AWS instance. Just click on the link to the HTML report and it should open.
What was the mapping efficiency? What percent of C’s were methylated in CpG context?
According to the report:
...
Mapping efficiency: 92.4%
...
C methylated in CpG context: 57.4%
C methylated in CHG context: 0.6%
C methylated in CHH context: 0.5%
C methylated in unknown context (CN or CHN): 3.5%
...
You can also look at plot the Cytosine Methylation
section in the interactive HTML report.
Close the report by pressing q.
We need to sort the bam file and prepare an index so we will be able to load it in IGV. We will use the program samtools for this.
samtools sort WGBS.A34002.137160.chr19.1_bismark_bt2_pe.bam -o WGBS.A34002.137160.chr19.1_bismark_bt2_pe_sorted.bam
samtools index WGBS.A34002.137160.chr19.1_bismark_bt2_pe_sorted.bam
How would you repeat the alignment with the other datasets?
This is the command to run bismark on the two other samples:
cd ~/workspace/module4
bismark --multicore 4 --bowtie2 $GENOME/genome/bismark_index \
-1 $WGBS_DATA/WGBS.A34002.137487.chr19.1.fastq.gz -2 $WGBS_DATA/WGBS.A34002.137487.chr19.2.fastq.gz
bismark --multicore 4 --bowtie2 $GENOME/genome/bismark_index \
-1 $WGBS_DATA/WGBS.A34002.137488.chr19.1.fastq.gz -2 $WGBS_DATA/WGBS.A34002.137488.chr19.2.fastq.gz
Remember, for the command to work, both $GENOME and
$WGBS_DATA need to be defined.
Also, if you want to generate the HTML reports, you can run
bismark2report. Bismark also has a “summary” that produces
a report for all samples (bismark2summary), we can run both
by doing the following:
bismark2report ; bismark2summary
After checking the reports, we can run the commands to prepare the samples for IGV (sort and index):
samtools sort WGBS.A34002.137487.chr19.1_bismark_bt2_pe.bam -o WGBS.A34002.137487.chr19.1_bismark_bt2_pe_sorted.bam
samtools index WGBS.A34002.137487.chr19.1_bismark_bt2_pe_sorted.bam
samtools sort WGBS.A34002.137488.chr19.1_bismark_bt2_pe.bam -o WGBS.A34002.137488.chr19.1_bismark_bt2_pe_sorted.bam
samtools index WGBS.A34002.137488.chr19.1_bismark_bt2_pe_sorted.bam
While you wait for the previous steps to finish executing, it is a good idea to begin exploring the alignments.
Retrieve the files called WGBS.A34002.137160.chr19.1_bismark_bt2_pe_sorted.bam and WGBS.A34002.137160.chr19.1_bismark_bt2_pe_sorted.bam.bai from the server using your internet browser and the IP address of your AWS instance.
Optionally, if you are not using Putty (i.e. if you are using a Linux or Mac computer) you can also retrieve the files directly using the terminal and scp using the commands below.
Optional download option (click here)
scp -i CBW.pem ubuntu@00.00.00.0:~/workspace/module4/WGBS.A34002.137160.chr19.1_bismark_bt2_pe_sorted.bam .
scp -i CBW.pem ubuntu@00.00.00.0:~/workspace/module4/WGBS.A34002.137160.chr19.1_bismark_bt2_pe_sorted.bam.bai .
Remember that for the commands above to work, you need to be in the directory with the CBW.pem (or CBW.cer) file you downloaded from AWS when creating your instance.
If you haven’t installed it yet, please get it here IGV download.
Make sure that the human genome is selected in the top left corner. It should read: Human (hg38).
Load your sorted bam file in IGV using File -> Load from file. For this to work, you need to have the index file (.bai) in the same location as the bam file.
Now go to the following location:
chr19:43,375,889-45,912,052
And zoom in until you see something.
For instance, try the following window:
chr19:44,527,387-44,536,873
You should see something like this:
If it looks different, can you change the way the colors are displayed?
Which section of which chromosome is covered by this dataset?
Can you see any interesting patterns in the coverage?
So far we have only mapped the reads using Bismark. We can generate methylation profiles using the following command:
cd ~/workspace/module4
bismark_methylation_extractor --cytosine_report WGBS.A34002.137160.chr19.1_bismark_bt2_pe.bam --genome_folder $GENOME/genome/bismark_index
The --cytosine_report option creates a CpG_report table summarizing key data for each CG position in the genome, which will be useful later (see section 4.2).
How would you do the same for the other replicates?
These are the commands that you should use:
cd ~/workspace/module4
bismark_methylation_extractor --cytosine_report WGBS.A34002.137487.chr19.1_bismark_bt2_pe.bam --genome_folder $GENOME/genome/bismark_index
bismark_methylation_extractor --cytosine_report WGBS.A34002.137488.chr19.1_bismark_bt2_pe.bam --genome_folder $GENOME/genome/bismark_index
Download all the files produced so far to your local computer using your internet browser.
While you wait for all the steps and downloads to finish, you can ask the instructors any questions you might have up until this point.
Load all the downloaded files in IGV using File -> Load from file.
Please be aware that if you just try to open all your files on IGV you might get a warning/error message mentioning that you are trying to open an unsorted BAM. If that is the case, ignore the message, but make sure that you are opening the sorted.bam so you can visualize your results.
At this point, if you load the region chr19:44,527,387-44,536,873 you should see something like
This promoter looks to be hypomethylated.
Can you find a promoter that is hypermethylated?
How about chr19:45,637,715-45,657,380?
How would you look for a CpG island using this view of the data?
The following section will use the Bioconductor package methylKit to do a differential methylation analysis.
You can do it in your own computer (if you have installed R and methylKit) or in the AWS instance.
To install methylKit locally on your computer, make sure you have a recent version of R and
follow the instructions in this page.
If you are working in AWS, you will need to load R. The image we provide already has the libraries we need.
To launch R simply type the following to your terminal:
cd ~/workspace/module4
R
If you did this properly, the following message will be displayed and your prompt will change from ubuntu@ip-00-00-00-0:~/workspace/module4$ to >:
```{:output} R version 4.2.3 (2023-03-15) – “Shortstop Beagle” Copyright (C) 2023 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) …
Once you have successfully launched `R`, you can load `methylKit` with the following command:
```{}
library("methylKit")
To read the alignment data into methylKit, run the following command:
methRaw.160 = processBismarkAln( location = "WGBS.A34002.137160.chr19.1_bismark_bt2_pe_sorted.bam",
sample.id="A34002.137160", assembly="hg38",
read.context="CpG", save.folder="methylkit")
This command will import the data into a format that is readable by methylKit. At the same time, it will save two files under the methylkit directory with the information so that it is easy to load again at any time:
methylkit/A34002.137160_CpG_conversionStats.txt
methylkit/A34002.137160_CpG.txt
If everything goes well and you see the files, do the same for the other two samples:
methRaw.487 = processBismarkAln( location = "WGBS.A34002.137487.chr19.1_bismark_bt2_pe_sorted.bam",
sample.id="A34002.137487", assembly="hg38",
read.context="CpG", save.folder="methylkit")
methRaw.488 = processBismarkAln( location = "WGBS.A34002.137488.chr19.1_bismark_bt2_pe_sorted.bam",
sample.id="A34002.137488", assembly="hg38",
read.context="CpG", save.folder="methylkit")
Now that all the samples have been read with methylKit, you can create a file list to make it easier to load the full dataset as a methylkit object. For the purposes of this tutorial, we will consider that samples belong to two experimental groups: A34002.137160 as the control group (treatment = 0) and A34002.137487 & A34002.137488 as the treatment group (treatment = 1). We use the methRead() function to create our object, as shown below:
file.list = list( file.path("methylkit", "A34002.137160_CpG.txt"),
file.path("methylkit", "A34002.137487_CpG.txt"),
file.path("methylkit", "A34002.137488_CpG.txt") )
myobj = methRead(file.list,
sample.id=list("A34002.137160","A34002.137487","A34002.137488"),
assembly="hg38",
treatment=c(0,1,1),
context="CpG",
mincov = 10
)
What do all the options in the methRead() command mean?
file.list object points to the location of the input
data in a MethylKit format.
sample.id points to a list with the appropriate
sample name for each file.
assembly specifies which build of the human
reference genome is used.
treatment specifies which sample belongs to each
experimental group.
context specifies the methylation context.
mincov specifies the minimum coverage required to be
included in the object.
For more details, please refer to the MethylKit user guide .
If the files were loaded properly, you can check the object you just created by running the following command:
myobj
Which should output the following message followed by previews of the contents of the object:
methylRawList object with 3 methylRaw objects
...
You can also get basic statistics on your object by using the following command:
getMethylationStats(myobj[[2]],plot=FALSE,both.strands=FALSE)
Before doing any additional analysis, methylKit needs to determine which methylated bases have sufficient coverage in all samples so they can be compared. To do that, the samples should be merged with the unite() function. This function has a parameter destrand= that is turned off by default. We will set the destrand option to TRUE which will merge the coverage of both strands. When doing your own analyses, be aware that for some kinds of methylation analyses (such as CpH methylation) results are strand-specific, so this option should be used carefully.
meth = unite(myobj, destrand=TRUE)
The standard function for Differential Methylation Analysis on methylKit is calculateDiffMeth(). It takes any merged methylkit object as input. Depending on the number of replicates, it uses either Fisher’s exact or logistic regression to calculate P-values. It also, automatically produces Q-values, which are a kind of adjusted P-value. To use it with the results we obtained before, run the following command:
myDiff = calculateDiffMeth(meth)
To check the output, just type myDiff and read the summary. If you want an example of the output, check the solution below.
This is what the output looks like:
{:output} methylDiff object with 2941 rows -------------- chr start end strand pvalue qvalue meth.diff 1 chr19 42002896 42002896 + 3.271268e-01 0.69569299 8.951407 2 chr19 42002978 42002978 + 1.912989e-01 0.60732656 -21.666667 3 chr19 42007251 42007251 + 6.999764e-05 0.03228847 -55.681818 4 chr19 42007255 42007255 + 3.958578e-01 0.75196047 -11.835106 5 chr19 42007283 42007283 + 8.451850e-01 0.91347038 -2.457757 6 chr19 42007314 42007314 + 9.102723e-01 0.92865750 -1.604278 -------------- sample.ids: A34002.137160 A34002.137487 A34002.137488 destranded TRUE assembly: hg38 context: CpG treament: 0 1 1 resolution: base
To filter results by their statistical significance, methylKit provides the getMethylDiff() function which allows you to extract only the deferentially methylated CpG’s that meet a specific Q-value threshold. Additionally, it is also possible to specify whether to keep hypo or hyper methylated CpG’s only. Finally, the bedgraph() function allows you to save the the methylDiff object into a BedGraph file so you can open it with your genome browser of choice. Let’s create two BedGraph files with hypo and hyper methylated CpG’s with a Q-value below 0.05 based on the data above:
myDiff.hyper = getMethylDiff(myDiff,qvalue=0.05,difference=10,type="hyper")
bedgraph(myDiff.hyper, file.name = "hyper.CpG.bedGraph", col.name = "qvalue")
myDiff.hypo = getMethylDiff(myDiff,qvalue=0.05,difference=10,type="hypo")
bedgraph(myDiff.hypo, file.name = "hypo.CpG.bedGraph", col.name = "qvalue")
Two new files should appear now in your workshop folder:
{:output} ~/workspace/module4/hyper.CpG.bedGraph ~/workspace/module4/hypo.CpG.bedGraph
By default, methylKit will compute results with an individual CpG resolution. To get Differentially Methylated Regions (DMR), you have to bin your results first, using a window size of your choice. The function to do this is tileMethylCounts(), which takes a regular methylkit object as input. In this case, we will create 1000bp bins using the following command:
tiles = tileMethylCounts(myobj,win.size=1000,step.size=1000,cov.bases = 10)
As with CpG level results, samples need to be merged before the analysis can continue:
meth.tiles = unite(tiles, destrand=TRUE)
Now, we will use the calculateDiffMeth() and getMethylDiff() functions to get the DMRs.
Do you know how to do it, based on the information above?
Based on the number of differentially methylated CpGs you found above, do you anticipate many statistically significant DMRs in your analysis?
Use the following commands to perform a DMR
analysis:
myDiff.tiles = calculateDiffMeth(meth.tiles)
myDiff.tiles.hyper = getMethylDiff(myDiff.tiles,qvalue=0.1,difference=10,type="hyper")
bedgraph(myDiff.tiles.hyper, file.name = "hyper.DMR.bedGraph", col.name = "qvalue")
myDiff.tiles.hypo = getMethylDiff(myDiff.tiles,qvalue=0.1,difference=10,type="hypo")
bedgraph(myDiff.tiles.hypo, file.name = "hypo.DMR.bedGraph", col.name = "qvalue")
Using the navigation pane, download the bedGraph files you just produced and try to open them with IGV.
Do the statistical results match what you had seen before when exploring the data?
What interesting genomic features are found close to the DMRs? What could this mean?
The following section will use the Bioconductor package DSS to do a differential methylation analysis.
You can do it in your own computer (if you have installed R and DSS) or in the AWS instance.
To install DSS locally on your computer, make sure you have a recent version of R and
follow the instructions in this page.
If you just did the previous section, you might not need to load R again. Otherwise, please launch R using the instructions in section 3.1.
Once you have successfully launched R, you can load DSS with the following command:
library("DSS")
The DSS library expects input data to be already summarized into the following columns for each CG position in the genome:
chr),pos),N),X).Fortunately, Bismark already produced a table (CpG_report.txt) with most of that information when we ran the bismark_methylation_extractor command (see section 2.5 to review this step). Therefore, we can import the CpG_report table into R and reshape it into the proper input for DSS.
First, we will load the CpG_report.txt table for sample 137160 to our R environment and save it as a variable called CpG.report.160. To do this, we will use the base R function read.table and name the columns as follows:
chr,pos,strand,X (num. of reads showing methylation at this position),C (num. of reads without methylation in this position),C_context (2-base context at this position),tri_context (3-base context at this position)The full command is as follows:
CpG.report.160 <- read.table("WGBS.A34002.137160.chr19.1_bismark_bt2_pe.CpG_report.txt", header = F, col.names = c("chr", "pos", "strand", "X", "C", "C_context", "tri_context"))
Next, we need to calculate the total number of reads for each position by adding up the ones with and without methylation (columns X and C in the table we just imported). We will name this new column N to follow the DSS convention. Finally, we will save the input table for DSS as CpG.DSS.table.160 and make sure it is in ascending order by position with the setorder command. You can find the full commands below:
CpG.report.160["N"] <- CpG.report.160["C"] + CpG.report.160["X"]
CpG.DSS.table.160 <- CpG.report.160[c("chr", "pos", "N", "X")]
Can you repeat the above process for the other two samples?
Use the following commands to import the remaining
CpG_report tables, as well as reformatting them into
appropriate DSS input.
CpG.report.487 <- read.table("WGBS.A34002.137487.chr19.1_bismark_bt2_pe.CpG_report.txt", header = F, col.names = c("chr", "pos", "strand", "X", "C", "C_context", "tri_context"))
CpG.report.487["N"] <- CpG.report.487["C"] + CpG.report.487["X"]
CpG.DSS.table.487 <- CpG.report.487[c("chr", "pos", "N", "X")]
CpG.report.488 <- read.table("WGBS.A34002.137488.chr19.1_bismark_bt2_pe.CpG_report.txt", header = F, col.names = c("chr", "pos", "strand", "X", "C", "C_context", "tri_context"))
CpG.report.488["N"] <- CpG.report.488["C"] + CpG.report.488["X"]
CpG.DSS.table.488 <- CpG.report.488[c("chr", "pos", "N", "X")]
Hint: If you want to check all the DSS input
tables are constructed properly, you can take a peek at the “top” of
each table with the head() command in R. For
example, you should see the following if you did all the previous steps
correctly:
head(CpG.DSS.table.160)
{:output} chr pos N X 1 chr19 60119 0 0 2 chr19 60120 0 0 3 chr19 60172 0 0 4 chr19 60173 0 0 5 chr19 60183 0 0 6 chr19 60184 0 0
The same command for the other two samples will give the same result, because there are no reads aligning to the beginning of chromosome 19 in our dataset.
Next, we will create the BS object that DSS will use by using the makeBSseqData command and the tables we just created:
BSobj <- makeBSseqData( list(CpG.DSS.table.160, CpG.DSS.table.487, CpG.DSS.table.488),
c("sample.160","sample.487", "sample.488") )
Notice how we are labeling the 3 samples with the names c("sample.160","sample.487", "sample.488") as part of this command. It is likely that you will receive a warning message when you run the command above indicating that the CG sites are not ordered. We will ignore it for now, in this case it should not impact the analysis.
If the object was constructed properly, we can inspect it’s properties by calling the variable name directly BSobj, which will output.
{:output} An object of type 'BSseq' with 2113330 methylation loci 3 samples has not been smoothed All assays are in-memory
The DSS library operates under different statistical assumptions than methylKit, which include a smoothing of methylation data to improve methylation analysis. By smoothing the data, DSS will attempt to correct the methylation percentage of a CpG site using the context of nearby sites. Additionally, it will then estimate the dispersion of the sites and perform a Wald statistical test to allow for comparsion across samples. These three steps are done in one single command called DMLtest which will take our BSobj as input.
We will run command is done as follows and save the results in a variable called dmlTest:
dmlTest <- DMLtest(BSobj,
group1=c("sample.160"),
group2=c("sample.487", "sample.488"),
smoothing=TRUE)
Notice how we defined the two experimental groups in the DMLtest command, using the we defined when creating BSobj. When the test finishes running, we can extract the significant Differentially Methylated Loci (DML) using the callDML command and our defined p-value threshold:
dmls = callDML(dmlTest, p.threshold=0.05)
Extracting Differentially Methylated Regions (DMR) works similarly, with the callDMR command. One important difference between methylKit and DSS is that the latter does not work using bins. Instead, DSS will estimate the length of a DMR based on the dispersion of individual CpGs. Therefore, contrary to DMRs calculated with methylKit, these regions will all have different lengths.
To ensure that DMRs we call in this workshop are comparable to what we calculated with methylKit, we will set the minimum length to 500bp. The callDMR command will look like this:
dmrs = callDMR(dmlTest, minlen = 500, p.threshold=0.05)
The DSS library will group all DMR results into a single table, making no distinction between hypo and hyper methylated regions. You can view the results by running calling the dmrs variable.
Look at the DSS differential methylation results and compare them to what was obtained by methylKit. Do you notice any important differences? Are there any overlaps?
To save the results of your DSS analysis as a CSV file, use the base R command write.csv. Unfortunately, there is no native DSS command to export the results into a BedGraph file, and converting the table outptut to this format is outside the scope of this workshop. We encourage you to find tools that do this after the workshop if you want to do a deeper comparsion between the DSS and methylKit results in your own datasets.
write.csv(dmls, file="DSS.DML.table.csv")
write.csv(dmrs, file="DSS.DMR.table.csv")
After saving the DSS results as CSV, remember to download them to your computer, where you can open them in the file explorer, or even a spreadsheet to improve visualization and filtering.
You can quit R using the quit() or q() command. Remember to stop your AWS instance after this lab to avoid unnecessary costs.
Once you are finished make sure you download all the files you need and continue exploring on IGV.
Lab Completed!
Congratulations! You have completed Lab 4!
We will now explore some of the tools that were covered in the lecture for module 5.
In this lab, we will:
From your command line terminal, go to your workspace folder.
cd ~/workspace
module5 directory in your home directory with the “rm” commandmodule5 directoryrm -rf module5
mkdir module5
cd module5
Home on the top menu), and select the following categories of datasets: On the “hg38” reference genome, “Histone” experiments for the “Blood” cell type. Click on View selectedYou can get a whole genome overview of the similarity of a group of tracks by using the Portal’s correlation tool.
At the bottom of the grid, click on the button “Correlate datasets”
You will see that similar tracks (e.g. those of the same assay, seem to correlate nicely. You can zoom in the view with the mouse scrolling wheel, or with the buttons at the lower right corner of the popup
You should get something like this:
We will now attempt to detect motifs in peak regions for transcription factor binding sites using HOMER
Choose Assembly hg19.
TFBS) assaysCTCF assay and the B cell cell typehttps://epigenomesportal.ca/tracks/ENCODE/hg19/84840.ENCODE.ENCBS400ARI.CTCF.peak_calls.bigBed. This file contains peaks that were called out of the TFBS ChIP-Seq experimenthttps://www.encodeproject.org/biosamples/ENCBS400ARI/mkdir homer
cd homer
wget https://epigenomesportal.ca/tracks/ENCODE/hg19/84840.ENCODE.ENCBS400ARI.CTCF.peak_calls.bigBed
bigBedToBed, which converts a binary bigbed file to its plain text file equivalent. Some of these tools have been pre-installed on your AWS imagebigBedToBed 84840.ENCODE.ENCBS400ARI.CTCF.peak_calls.bigBed 84840.ENCODE.ENCBS400ARI.CTCF.peak_calls.bed
mkdir output
mkdir preparsed
findMotifsGenome.pl 84840.ENCODE.ENCBS400ARI.CTCF.peak_calls.bed hg19 output -preparsedDir preparsed -p 4 -S 15
Next, we will try to identify GO terms connected to ChIP-Seq peaks calls using GREAT. We need bed files to use the GREAT portal. We will do the conversion from a bigBed file to a bed file on our AWS session
Human (hg38), filter the tissues list to keep only “Bone Marrow” tissuesERS1027405:Click “Download tracks” at the bottom of the grid
On the download page, click on View Full URL List. This will give you a text list with all tracks of interest Copy the link to this page in your clipboard, using the address provided in your browser’s URL bar
Open another terminal session to get into AWS
Go to your module5 directory and create a place to put the material we will download
cd ~/workspace/module5
mkdir great
cd great
wget -O trackList.txt 'https://epigenomesportal.ca/api/datahub/download?session=18731&format=text'
wget -i trackList.txt
bigBedToBed 58394.Blueprint.ERS1027405.H3K27ac.peak_calls.bigBed 58394.Blueprint.ERS1027405.H3K27ac.peak_calls.bed
Note
If you’re under Linux / Mac, you can also install the UCSC tools locally, as they are a useful set of tools to manipulate tracks data, without requiring so much processing power.
sort -R-n20000sort -R 58394.Blueprint.ERS1027405.H3K27ac.peak_calls.bed > 58394.Blueprint.ERS1027405.H3K27ac.peak_calls.random.bed
head -n 20000 58394.Blueprint.ERS1027405.H3K27ac.peak_calls.random.bed > 58394.Blueprint.ERS1027405.H3K27ac.peak_calls.random_short.bed
58394.Blueprint.ERS1027405.H3K27ac.peak_calls.random_short.bed locally using your browserhttp://<your VM ip address>/module5/great/58394.Blueprint.ERS1027405.H3K27ac.peak_calls.random_short.bed
Load the GREAT website: http://bejerano.stanford.edu/great/public/html/
Provide the following input to the GREAT interface:
Submit the form
In the results, for instance, you should obtain something like this for biological processes:
Bonus question: Why is your result slightly different from the screenshot?
cd ~/workspace/module5/homer/
zip -r homer.zip output
http://<your VM ip address>/module5/homer/homer.zip
If you have time remaining, you can explore further the tools that we covered in this lab, using other types of datasets. For example, does running a GREAT query on another cell type yield the type of annotations that you’d expect?
Interested in exploring Galaxy? You can start tackling a Galaxy introduction extra lab, available here. It uses the usegalaxy.org server, so you can also follow it at your own pace after the workshop.
Lab Completed!
Congratulations! You have completed Lab 5!